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Abstract —This work provides two statistical Gaussian forecast¬ 
ing methods for predicting First Daily Departure Times (FDDTs) 
of everyday use electric vehicles. This is important in smart grid 
applications to understand disconnection times of such mobile 
storage units, for instance to forecast storage of non dispatchable 
loads (e.g. wind and solar power). We provide a review of the 
relevant state-of-the-art driving behavior features towards FDDT 
prediction, to then propose an approximated Gaussian method 
which qualitatively forecasts how many vehicles will depart within 
a given time frame, by assuming that departure times follow a 
normal distribution. This method considers sampling sessions as 
Poisson distributions which are superimposed to obtain a single 
approximated Gaussian model. Given the Gaussian distribution 
assumption of the departure times, we also model the problem 
with Gaussian Mixture Models (GMM), in which the priorly 
set number of clusters represents the desired time granularity. 
Evaluation has proven that for the dataset tested, low error 
and high confidence (« 95%) is possible for 15 and 10 minute 
intervals, and that GMM outperforms traditional modeling but 
is less generalizable across datasets, as it is a closer fit to the 
sampling data. Conclusively we discuss future possibilities and 
practical applications of the discussed model. 

Index Terms —Times forecasting. First Daily Departure Times, 
Vehicle-to-Grid integration, Gaussian modeling, Gaussian Mix¬ 
ture Models, Grid load shifting 

I. Introduction 

With an increasing use of Plug-in Electric Vehicles (PEVs), 
mobile units can be seen as a potential grid-connected energy 
storage means without compromising their primary mobility 
functionality: A PEV fleet can store, for instance, power from 
non dispatchable loads (e.g. solar panel and wind turbine 
sources) |[I]|. However, connections of PEVs to the grid, in 
terms of times and locations, are complex to model given such 
logistic mobility. This work focuses on how to meaningfully 
model fleet-level departure times over the commuter time 
frame 6 am - 9 am, in order to predict the availability of 
PEVs as grid storage over time. Heuristic assumptions such 
as over- or under-estimation of arrival/departure times both 
suffer from shortcomings and will result in inefficient energy 
use: an accurate forecast is therefore of paramount importance. 
We exploit Eirst Daily Departure Times (EDDT), which are 
a key piece of information for connection time estimation 
in PEV load shifting algorithms |[^|, but are hard to predict 


using historical realizations alone or via basic distribution 
modeling Q. This research focused on understanding how 
to accurately predict PEV EDDT for successful load shift 
scheduling. Such accuracy analysis was first performed via 
preliminary feature correlation analysis with EDDT (Section 
InuBl ), thanks to the availability of a dataset with diverse driv¬ 
ing behavior features, which contain, for instance, information 
sampled from individual drivers regarding average trip length 
and duration (Section |III-A| ). Given the lack of forecasting 
capability of such features towards EDDT prediction, the 
research then makes progress towards approximated Gaussian 
modeling under specific a priori running assumptions (Section 
|III-C| ). We provide theoretical background (Section |IV-A| ) to a 
method for computing lower and upper bounds of PEV EDDT 
for each time interval (Section |IV-B| ), a time interval scaling 
method (Section [IV-C| ), and provide a brief validation of such 
methods (Section]V|). Concludingly, we provide a summary 
on the proposed method and possible insights regarding future 
work (Section rvi|). 


H. Related Work 

Previous studies take into account aggregations of driver’s 
behavior features for activity-based forecasting, aiming at 
Transportation Demand Management (TDM) congestion 
planning or logistic network optimality [SI]. In particular, 
behavior aggregation has been useful to understand the actions 
that provoke inter-relations among individuals, in order to 
cluster vehicle movement by activity |4l. 

In |[6]| the EDDT prediction is based on utility maximization 
of the vehicle trip and activity participation. The activities 
are defined as driver intentions such as “being home before 
work”, while trip is characterized by departure and arrival 
times. Another method 1[7]| uses a multilevel approach which 
claims that EDDT is dependent on individual attributes such as 
gender, age, profession and macro-level attributes such as day 
of the week, location and household income. Each attribute 
is modeled using normal distributions and the prediction is 
based on log likelihood maximization. However both these 
approaches require private (usually unavailable) information 
about each driver (e.g. type of activities engaged in after 














work). Goedel 1(3]] provides different approach which takes into 
consideration the day of the week as feature and a vehicle- 
based analysis of commuters, in order to predict a departure 
conhdence interval. In other work \M\, charging profile pre¬ 
dictions are based on stochastic analysis of the conditional 
Probability Density Function (PDF) over FDDT, daily arrival 
times and daily traveled distances. However, both methods 
provide a one hour interval precision of FDDT which is not 
sufficient within the domain of load shift prediction. Given 
a low correlation among FDDT and driver behavior features, 
the presented research focused on Gaussian modeling of FDDT 
data only. Furthermore, the advantage of considering only first 
daily departure times is that the research can disregard the 
complexity entailed by modeling the multiple stop factor. 

III. Data Understanding 


Selection ifTOl and a correlation-based Principal Component 
Analysis (PGA) fTTTl . making use of an Independent and 
Identically Distributed (IID) assumption. Such feature filters 
yielded a very low correlation between features and data, 
making these unserviceable for machine learning (see Table 

0 . 


Correlation-based selected features 

Correlation 
with FDDT 
(Start tm) 

total_speed_velocity_ratio 

+0.15 

percent_distance_fifty_five_sixty 

-0.21 

absolute_time_duration_hrs 

-0.3 

descending_rate_median_absolute_deviation 

-0.08 

max_deceleration_event_duration 

-0.33 

average_acceleration_event_duration 

+0.04 

min_deceleration_event_duration 

-0.05 


We now describe the reasoning behind the adoption of the 
training and test set (Section |III-A| ), the feature correlation 
analysis (Section |III-B| ), and the set of assumptions that are 
required for this statistical modeling problem (Section |III-C| ). 


A. Data Adoption 

This project makes use of datasets from NREL’s Secure 
Transportation Data Project |[^], in particular Texas Depart¬ 
ment of Transportation - Transportation Studies with GPS 
Travel Diaries. 

The main reasons for such adoption are: 

• the dataset comprises many real-time features of the 
trips (e.g. interval times, speeds, accelerations, statistical 
measures - see Table |IJ 

• the features present high precision and low granularity 

• all data has been electronically tracked 

• given the geographical location (Texas), we assume cli¬ 
mate variability to be low and therefore not influencing 
departure times 

A major downfall of the dataset is that it does not comprise 
labeling for the day of the week, and furthermore all samplings 
have been performed only on Tuesdays or Wednesdays. 


Feature 

Description 

startjm 

distance_total 

percent_fifty_five_sixty 

driving_speed_standard 

_deviation 

The start time of the first recorded 
point for the vehicle 

Total travelled distance in miles 
Percent of total time spent at 
speeds between fifty five and sixty 
miles per hour 

Standard deviation of driving 
speed distribution 


Table I: Listing and descriptions of examples of features 
present in the NREL Transportation dataset [[9]|. 


B. Feature Analysis 

Correlation among potential features and the class to pre¬ 
dict (EDDT) is a necessary but not sufficient condition for 
pattern learning. We analyzed the potential predictive ability of 
each feature by executing a Correlation-based Feature Subset 


Table II: Eeatures with the highest correlation with EDDT 
(StarMm), and with the lowest correlation among themselves. 
Eor a description of the cited features, see |[^|. 


C. Assumptions 

An initial intuition after viewing the variety of available 
features in the dataset (Section |III-A| ) would suggest the pos¬ 
sibility of performing high-dimensional regression with such 
diverse components. However, this approach is not possible 
with the current dataset, since features presented very low 
correlation values and hence low or no learning potential (see 
Table |n|). Therefore we proceed in assuming that every sample 
is Independent and Identically Distributed (IID), i.e. that the 
EDDT of a vehicle does not influence the EDDT of another. 
Consequently, we do not consider the problem as a time-series 
analysis as understood in literature ifO . 

Given such information, instead of predicting the exact depar¬ 
ture time of the sample, it is more convenient to forecast, given 
historical values, i) how many EDDT will fall within certain 
time intervals, ii) the confidences of the latter, and hi) a system- 
level interval granularity itself. By empirical analysis and by 
assumption we define that the EDDT sampling undertaken 
for the training dataset is distributed according to Poisson’s 
definition. 


IV. Approximated Gaussian Modeling 


We proceed in describing a method to constrain our problem 
to the Gaussian modeling domain (Section |IV-B[ |IV-C| ), given 
the assumptions in Section |ni-C 


A. Theoretical Framework 


1) Poisson distribution: In probability theory, the discrete 
Poisson distribution expresses the likelihood of a number 
of events occurring sequentially and independently of each 
other within a given time frame, knowing that on average a 
given number A occurs. Eor an in-depth description of the 
mathematical properties of this distribution, we refer to ifTSll . 
We exploit the mathematical property that for a hypothetically 
infinite number of samplings, the superimposition of such Pois¬ 
son distributions converges to a Gaussian (Normal) distribution 




























ilTSl . Due to the latter property, it is then possible to model 
with traditional Gaussian assumptions. 

2) Gaussian distribution: A Gaussian (Normal) is a con¬ 
tinuous probability distribution that often characterizes real¬ 
valued random variables in applied contexts. In this context we 
model the distribution over time intervals and their confidence. 
The latter is possible by defining the percentage of values 
captured by a distance ka, where is variance from the mean 
II, as seen in 


r+ka 


I — ka 




( 1 ) 


For k = 2 we obtain confidence of 95%. For a deeper 
mathematical description we refer to fl4l . 

B. Computing Time Intervals 

By aggregating the timestamp samples of a single 
sampling session in discrete time intervals, we obtain a 
Poisson distribution. 

If we sample a sufficiently large dataset, or a heterogeneous 
set of sampling sessions, the superimposition of these 
Poisson distributions converge to an approximated Gaussian 
distribution IflSlI . 

Let n be the index of the current sampling session and N the 
total number of samplings which have been operated. Let b 
be an arbitrary constant that defines the number of bins (and 
therefore the time interval granularity). We then construct a 
matrix K: 


K'^ = 


We compute an estimation of the lower bound and the upper 
bound of the number of PEV departures in the time interval 
ti e 0,..., 6 with a probability of 95%: 


Algorithm 1: PEV departure number within time interval 
computation 

Data: 

• minDepTime, earliest departure time 

• maxDepTime, latest departure time 

• TD, a. training set containing n sampling sessions of 
departure times 

• Clvalue, the percentage value of the desired confidence 
interval 

Result: Timeint Mar gins, PEV departure number for 
each time interval 

begin 

b^O 

j ^ 0 

Mmatrix i — 0 
e% i — (1 — CIvalue) 

TDcut <— 

trimRange{TD, minDepTime, maxDepTime) 

while 

e% < {mini (Mmatrix^) j^Jmini {Mmatrix'^)'^ 


do 


increase b 

Kmatrix < 
M matrix 


— divideInIntervals{TDcut,b) 

— imposeAndAvg{Kmatrix, b, n) 

while j < 6 do 

intMargins^ i — compMargins{Mmatrix^) 
output intMargins^ 


fKS ■■ 

■ K^o'] 



increase j 


■ K\ 




■■ 

■ k) 

(2) 

D. Expectation-Maximization for Gaussian Mixture Models 


time interval margins ti 
where: 


m — 


m\ m 


m 


= -y^K] 


(3) 


(4) 


C. Granularity Scaling 

We want to hypothetically increase b in order to have a time 
prediction interval as small as possible. We define the error £% 
as the wanted percentage error of our confidence interval. 

We obtain the lowest time granularity possible without 
lowering the given confidence interval by imposing that: 


Given the Gaussian assumption used throughout this text, 
for which all first time departures follow a Normal distribution, 
we made use of Gaussian Mixture Models (GMM) to cluster 
EDDTs, in which each cluster represents a bin as described 
in Section IIV-BI We model time intervals as a Gaussian 
distribution, and the time inside each interval is additionally 
characterized by a Gaussian distribution. Eor this we use a 
mixture model with K components where each component is 
a multivariate Gaussian density: 


gi{x\iii, Tii) 


(27r)i>/2|E.|V2 




(7) 




m 




(5) 


m'- 


where 0 < < 1 and: 

m^^^ = min(m° ... m^) (6) 

An overview of the entire modeling here discussed in Sections 


IV-B and IV-C can be viewed in Algorithm 1. 


where pi is a mean, Ti is a covariance matrix, x E D, 
where D is a given dataset, i = 1,... ,K. In order to learn 
unsupervisedly the parameters of the latent models charac¬ 
terizing the multivariate Gaussian distribution, we make use 
of the iterative Expectation Maximization (EM) algorithm 
ifTSlI . which finds the maximum likelihood of parameters also 
with low resolution distribution data, such as in the case of 
our approximated Gaussian model derived from superimposed 
Poisson distributions. Eor a more detailed description of EM 
for GMM, we refer to tlTM . 




















V. Validation Results and Model Usability 
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Figure 1: Graphical representation of the confidence intervals 
obtained with the approximated Gaussian modeling described 
in Section 


IV-B 


Validation was not possible with the current running as¬ 
sumptions, given the reduced number of samples of our 
training data after time frame (6am—9am) pruning. In order to 
increase the number of training samples, we superimposed the 
pruned sets of different cities which share similar urban and 
climatic characteristics (namely Austin, San Antonio, Houston, 
El Paso) to create a training set, and then used the super¬ 
imposed Gaussian model to validate on a test set formed by 
samples from Rio Grande Valley. The training set for all three 
incoming experiments contained 758 instances of data for the 
time period from 6:00 am to 9:00 am, and the test set for all 
experiments contained 260 instances of data for the same time 
period. The set was normalized for explanatory convenience 
(i.e. every bin defines the percentage of vehicles that depart 
in the bin time interval). Statistical dispersion was computed 
with Gauss error function (erf) for our validating model which 
evaluates the probability that measurement x is within a range 
from — to To compute erf we used a following 

formula: 

X 

erf{x) = -^= J dt (8) 

0 

a) First experiment (approximated Gaussian modeling, 
12 timeframes): The model derived from the theory presented 
in Section |IV-B| validated the model on all bins. Margin 
computations are shown in Fig. while validation results can 
be seen in Fig. The resulting erf for such approximated 
Gaussian modeling is shown in Table |jj^ 

The implementation of Algorithm 1 focused on granularity 
understanding and margin computation, in which we can com¬ 
pute a trade-off between estimation confidence (via modeling 
the k parameter in Eq. 0, and time interval granularity (i.e. 
the number of bins). 


Time intervals 

Average margin values 

erf values 

6.00-6.15am 

0.0584 

0.0658 

6.15-6.30am 

0.0729 

0.0821 

6.30-6.45am 

0.0756 

0.0851 

6.45-7.00am 

0.0809 

0.0911 

7.00-7.15am 

0.1088 

0.1223 

7.15-7.30am 

0.1207 

0.1355 

7.30-7.45am 

0.1300 

0.1459 

7.45-8.00am 

0.1074 

0.1207 

8.00-8.15am 

0.0849 

0.0956 

8.15-8.30am 

0.0504 

0.0568 

8.30-8.45am 

0.0570 

0.0642 

8.45-9.00am 

0.0531 

0.0599 

Average erf value 

0.0938 

Normalized score on a number of bins 

1.1256 


Table III: Gauss error function results for the approximated 
Gaussian model on average margin values (12 bins, 15 minutes 
each). 


Time intervals 

Predicted values 

erf values 

6.00-6.15am 

0.0269 

0.0303 

6.15-6.30am 

0.0654 

0.0737 

6.30-6.45am 

0.1038 

0.1167 

6.45-7.00am 

0.0692 

0.0780 

7.00-7.15am 

0.1577 

0.1765 

7.15-7.30am 

0.0962 

0.1082 

7.30-7.45am 

0.0146 

0.0165 

7.45-8.00am 

0.0135 

0.0152 

8.00-8.15am 

0.0692 

0.0780 

8.15-8.30am 

0.0769 

0.0866 

8.30-8.45am 

0.0346 

0.0390 

8.45-9.00am 

0.0192 

0.0217 

Average erf value 

0.0700 

Normalized score on a number of bins 

0.8400 


Table IV: Gauss error function results for GMM-model on Rio 
Grande Valley values (12 bins, 15 minutes each). 


b) Second experiment (Gaussian Mixture Modeling, 12 
timeframes): For the EM method we used training data 
containing only EDDTs (StarMm) for different vehicles as 
feature and validated it on the previously mentioned test 
data (Rio Grande Valley set). Each instance of the dataset is 
associated with a single vehicle and the resulting model of 
the EM algorithm, shown in Eig. illustrates the dependency 
between departure times and number of vehicles departing at 
the particular time interval. We modeled 12 clusters, i.e. 12 
timeframes of 15 minutes each. The model shows that the 
highest amount of vehicles was departing at 7:00 am - 7:15 
am which makes this interval the most probable for future 
predictions under the current assumptions. The resulting erf 
values are shown in Table |IV| The average erf values are 
0.094 and 0.07 for approximated-Gaussian modeling and EM 
for GMM respectively (i.e. first and second experiment). 

c) Third experiment (Traditional and Gaussian Mixture 
Modeling, 18 timeframes): We repeated both the EM method 
for Gaussian Mixture Modeling and traditional approximated 
Gaussian modeling, this time with 18 timeframes, i.e. 10 
minute intervals, using only EDDTs (StarMm). Results are 
visible in Eig. and |5j\vhile Gaussian error values are 
displayed in Table |V| andjVl| Overall, given the graphical and 
error results, we can confirm the Gaussian assumption on such 
real data model. We can observe that approximated Gaussian 





































































Figure 2: The superimposed Gaussian model validated against Figure 4: The superimposed Gaussian model validated against 
Rio Grande Valley ground values (12 bins, 15 minutes each). Rio Grande Valley ground values (18 bins, 10 minutes each). 





Figure 3: EM for Gaussian Mixture Model algorithm tested on 
Rio Grande Valley dataset (12 bins, 15 minutes each). 





Figure 5: EM for Gaussian Mixture Model algorithm tested on 
Rio Grande Valley dataset (18 bins, 10 minutes each). 


modeling preserves the form factor across departures per time 
intervals, whereas GMM provide a closer approximation to the 
training data. 

VI. Conclusions and Euture Work 

Modeling Eirst Daily Departure Times (EDDT) of electric 
vehicles is of paramount importance for smart grid load shift 
planning, as these can be used as temporary energy storage 
units. By making a Gaussian distribution assumption of such 
departure times, we have provided a i) traditional Gaussian 
modeling approach with confidence and time interval size 
modeling, and ii) a Gaussian Mixture Model approach to 
compute clusters associated to time intervals. Evaluation has 
proven that for the dataset tested, low error and high confidence 
95%) is possible for 15 and 10 minute intervals. By 
inspection of the normalized score on a number of bins for 
both methods (Eig. |^, we notice that GMM method is more 
subject to error when increasing time interval granularity, but 
requires less data to formulate the model. Euture work will 
be oriented towards testing the presented Gaussian model on 
large datasets, implementing error propagation when relaxing 
the IID assumption (i.e. assuming that all cars depart), and 
considering confidence and error trade-off for practical ap¬ 
plications. Eurthermore a collaboration with transport survey 
research centers would be useful to gather more vehicle-related 
and activity-related data, in order to cluster by the latter and 
by points of interest, to then verify feature correlation with 
EDDTs. 
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